Squall: Scalable Real-time Analytics using Efficient, Skew-resilient Join Operators

نویسنده

  • Aleksandar VITOROVIĆ
چکیده

Squall is a scalable online query engine that runs complex analytics in a cluster using skewresilient, adaptive operators. Online processing implies that results are incrementally built as the input arrives, and it is ubiquitous for many applications such as algorithmic trading, clickstream analysis and business intelligence (e.g., in order to reach a potential customer during the active session). This thesis presents an overview of Squall, including some novel join operators, as well as lessons learned over five years of working on this system. Existing open-source online systems (e.g. Twitter Storm, Spark Streaming) provide only hash-joins, which are limited to equi-joins and prone to skew. In contrast, Squall puts together state-of-the-art skew-resilient partitioning schemes (including some of our own), local query operators, and techniques for scalable online query processing. Such a system allows us to leverage the effect of various design choices on the performance, to seamlessly build efficient novel operators, and to discover and address new skew types (e.g. dependence on tuple arrival order) that can arise only in online systems. Existing partitioning schemes for joins work well only for a narrow set of data distribution properties, that is, specific proportion of join output and input sizes for 2-way joins, or similar data distribution among all the relations for multi-way joins. In contrast, Squall covers the entire spectrum of different data distributions by providing two novel skew-resilient partitioning schemes: (a) a scheme for 2-way non-equi joins partitions the data using a multistage load-balancing algorithm that contains a join-specialized computational geometry algorithm, and (b) a scheme for multi-way joins which constructs a composite partitioning, consisting of different partitioning schemes according to the skew degree in different relation attributes. Compared to state-of-the art, our schemes achieve up to 15× speedup and are up to 5×more efficient in terms of resource consumption.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Squall: Scalable Real-time Analytics

Squall is a scalable online query engine that runs complex analytics in a cluster using skew-resilient, adaptive operators. Squall builds on state-of-the-art partitioning schemes and local algorithms, including some of our own. This paper presents the overview of Squall, including some novel join operators. The paper also presents lessons learned over the five years of working on this system, a...

متن کامل

Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning

Shared nothing multiprocessor archit.ecture is known t.o be more scalable to support very large databases. Compared to other join strategies, a hash-ba9ed join algorithm is particularly efficient and easily parallelized for this computation model. However, this hardware structure is very sensitive to the data skew problem. Unless the parallel hash join algorithm includes some load balancing mec...

متن کامل

Efficient Large Outer Joins over MapReduce

Big Data analytics largely rely on being able to execute large joins efficiently. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially on the extremely popular MapReduce platform. In this paper, we studied several current algorithms/techniques used in large outer joins. We f...

متن کامل

V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors

This work proposes V-SMART-Join, a scalable MapReducebased framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage computes and joins the partial results, and th...

متن کامل

An Optimal Skew-insensitive Join and Multi-join Algorithm for Distributed Architectures

The development of scalable parallel database systems requires the design of efficient algorithms for the join operation which is the most frequent and expensive operation in relational database systems. The join is also the most vulnerable operation to data skew and to the high cost of communication in distributed architectures. In this paper, we present a new parallel algorithm for join and m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016